Skip to content

Conversation

@Yanlilyu
Copy link
Contributor

@Yanlilyu Yanlilyu commented Jul 21, 2023

Description:
This branch adds a log_based routine to help user find the most I/O intensive jobs. The file is job_stats.py in PyDarshan CLI tools.
It combines the aggregated data from multiple log files to a DataFrame, sorts data by the column name the user inputs in a descending order, and then filters the data with the first n (number_of_rows from user input) records. It returns a DataFrame with the n most I/O intensive jobs
User input includes log_path, module, order_by_colname, number_of_rows.
log_path should be a path glob pattern.
order_by_colname should be "agg_perf_by_slowest", "agg_time_by_slowest", or "total_bytes".
Example usage: $ python -m darshan job_stats darshan_logs/nonmpi_workflow/"worker*.darshan" STDIO total_bytes 5

Copy link
Collaborator

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried playing with it a bit this morning and thought I'd make a few suggestions:

I tried this command: python -m darshan job_stats ~/github_projects/darshan-logs/darshan_logs/mpi_io_test_with_dxt/treddy_mpi-io-test_id4373053_6-2-60198-9815401321915095332_1.darshan STDIO total_bytes 5

And got:

Statistical data of jobs:    agg_perf_by_slowest  agg_time_by_slowest                                  category_counters  ...  unique_io_total_time_by_slowest  unique_md_only_time_by_slowest  unique_rw_only_time_by_slowest
0             16.47327             0.000094  [<cdata 'struct darshan_file_category_counters...  ...                         0.000094                             0.0                        0.000094

Where the columns look a bit misaligned with the data? That's an easy enough fix I think:

--- a/darshan-util/pydarshan/darshan/cli/job_stats.py
+++ b/darshan-util/pydarshan/darshan/cli/job_stats.py
@@ -162,7 +162,7 @@ def main(args: Union[Any, None] = None):
         com_dfs = combined_dfs(list_dfs)
         combined_dfs_sort = sort_data_desc(com_dfs, order_by_colname)
         combined_dfs_selected = first_n_recs(combined_dfs_sort, n)
-        print("Statistical data of jobs:", combined_dfs_selected)
+        print("Statistical data of jobs:\n", combined_dfs_selected)
Statistical data of jobs:
    agg_perf_by_slowest  agg_time_by_slowest                                  category_counters  ...  unique_io_total_time_by_slowest  unique_md_only_time_by_slowest  unique_rw_only_time_by_slowest
0             16.47327             0.000094  [<cdata 'struct darshan_file_category_counters...  ...                         0.000094                             0.0                        0.000094

That looks better!

I also tried running on all of the logs in the logs repo, just to cause trouble:

python -m darshan job_stats ~/github_projects/darshan-logs/darshan_logs/**/*.darshan STDIO total_bytes 5

And ran into an error:

usage: darshan <command> job_stats [-h] log_path module order_by_colname number_of_rows
darshan <command> job_stats: error: argument number_of_rows: invalid int value: '/Users/treddy/github_projects/darshan-logs/darshan_logs/e3sm_io_heatmaps_and_dxt/e3sm_io_heatmap_and_dxt.darshan'

Curiously, if I run on that file directly it seems to work ok, so may be related to the ** glob pattern vs. command line stuff?

Although I don't know Shane/your full vision for the use cases just yet, here's a suggestion to get the regression testing started:

--- /dev/null
+++ b/darshan-util/pydarshan/darshan/tests/test_job_stats.py
@@ -0,0 +1,32 @@
+import argparse
+from unittest import mock
+
+from darshan.log_utils import get_log_path
+from darshan.cli import job_stats
+import pytest
+
+
+@pytest.mark.parametrize(
+    "argv", [
+        [get_log_path("e3sm_io_heatmap_only.darshan"),
+         "STDIO",
+         "total_bytes",
+         "5"],
+    ]
+)
+def test_job_stats(argv, capsys):
+    with mock.patch("sys.argv", argv):
+        # initialize the parser
+        parser = argparse.ArgumentParser(description="")
+        # run through setup_parser()
+        job_stats.setup_parser(parser=parser)
+        # parse the input arguments
+        args = parser.parse_args(argv)
+    job_stats.main(args=args)
+    captured = capsys.readouterr()
+    assert "3.258853" in captured.out
+    # NOTE: probably much better to use i.e.,
+    # pd.read_csv or similar on the string data
+    # and reconstitute a DataFrame where you could
+    # use i.e., assert_frame_equal for floating point
+    # comparisons, etc.

Also, if printing to the terminal is the "final medium" for output, we could look at libraries like https://github.com/Textualize/rich or pandas-related things built on top of that for some nice formatting I suppose.

@shanedsnyder
Copy link

I agree with Tyler's suggestions above, so if we could incorporate those I think that'd be great.

also tried running on all of the logs in the logs repo, just to cause trouble:

python -m darshan job_stats ~/github_projects/darshan-logs/darshan_logs/**/*.darshan STDIO total_bytes 5

And ran into an error:

Same here. I think this is being caused by shell wildcard expansion -- your shell will automatically expand the pattern above into all of the possible Darshan log files. To this script then, there will be a ton of log files followed by the remainder of the command line arguments which causes the rest of the command line parsing to blow up, obviously. If you enclose your pattern above in quotes (and use your actual home directory name rather than ~), it works fine as no expansion occurs.

Thinking about it more, maybe we should just lean totally on the shell wildcard expansion and not worry about anything glob related in the python script. I think users will be tricked by this too, figuring out the right way to express a pattern to get through the shell all the way to python. This means the script would instead just need to accept a list of log files on the command line, with the user responsible for creating that list (using a shell glob or by manually specifying them). Any objections to that? (Sorry @Yanlilyu, I know this is what you originally started with, but this may have been a bad suggestion on my part...)

If we go that direction, I'd suggest switching the remainder of the command line arguments from positional arguments to named arguments (e.g., -n rows, -m module, etc.). Further, we should probably just give them default values so users don't have to always specify them. We could default to total_bytes, POSIX module, and all rows, for instance?

Also, if printing to the terminal is the "final medium" for output, we could look at libraries like https://github.com/Textualize/rich or pandas-related things built on top of that for some nice formatting I suppose.

I agree, this would be nice, but we can always explore this later. I think the ultimate output format for this tool remains just shell output (it's really just a quick way to scan a bunch of log files for interesting jobs, it doesn't really need a super formal output format), so maybe that would be something like Rich rather than Pandas styling?

@shanedsnyder
Copy link

Oh, and BTW, after figuring out how to properly format the glob to account for all logs in the Darshan logs repo, I do run into an error that should be properly guarded against:

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/tmp/darshan/darshan-util/pydarshan/darshan/__main__.py", line 3, in <module>
    main()
  File "/tmp/darshan/darshan-util/pydarshan/darshan/cli/__init__.py", line 164, in main
    mod.main(args)
  File "/tmp/darshan/darshan-util/pydarshan/darshan/cli/job_stats.py", line 162, in main
    df_i = df_IO_data(log_paths[i], mod)
  File "/tmp/darshan/darshan-util/pydarshan/darshan/cli/job_stats.py", line 24, in df_IO_data
    posix_recs = report.records[mod].to_df()
KeyError: 'STDIO'

Not every log is guaranteed to have a given module included in the log file. If a module isn't found in a log file, we should assign it "N/A" values or something like that and put it at the end of the list. Does that seem reasonable?

@shanedsnyder
Copy link

One final comment after looking this over: Can we drop all other columns from the derived statistics dataframe other than the ones we are concerned with (total_bytes, agg_perf_by_slowest, agg_time_by_slowest)? I don't think they are useful to include and will help make the output more digestible.

@Yanlilyu Yanlilyu closed this Aug 11, 2023
@shanedsnyder
Copy link

We moved this all into one PR: #954

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants